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Abstract 

Los Alamos National Laboratory (LANL) is home to many large supercomput¬ 
ing clusters. These clusters require an enormous amount of power (~500-2000 kW 
each), and most of this energy is converted into heat. Thus, cooling the compo¬ 
nents of the supercomputer becomes a critical and expensive endeavor. Recently a 
project was initiated to investigate the effect that changes to the cooling system in 
a machine room had on three large machines that were housed there. Coupled with 
this goal was the aim to develop a general good-practice for characterizing the effect 
of cooling changes and monitoring machine node temperatures in this and other 
machine rooms. This paper focuses on the statistical approach used to quantify 
the effect that several cooling changes to the room had on the temperatures of the 
individual nodes of the computers. The largest cluster in the room has 1,600 nodes 
that run a variety of jobs during general use. Since extremes temperatures are 
important, a Normal distribution plus generalized Pareto distribution for the up¬ 
per tail is used to model the marginal distribution, along with a Gaussian process 
copula to account for spatio-temporal dependence. A Gaussian Markov random 
field (GMRF) model is used to model the spatial effects on the node temperatures 
as the cooling changes take place. This model is then used to assess the condition 
of the node temperatures after each change to the room. The analysis approach 
was used to uncover the cause of a problematic episode of overheating nodes on 
one of the supercomputing clusters. This same approach can easily be applied to 
monitor and investigate cooling systems at other data centers, as well. 
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1 Introduction 

The cooling of components in high performance computing (HPC) centers is a critical 
issue. Most of the hundreds of kilowatts of energy used to power a large supercomputing 
machine are converted into heat. This heat must be taken away from the components in 
order to prevent overheating and damage. Thus, cooling strategy is a major consideration 
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Table 1; Temperature thresholds and specs for the three compute machines in room 341. 


Cluster 

Thresholds 

Machine Specs 

Warning 

High 

Critical 

# Nodes 

Cores/Node 

Load (kW) 

Mustang 

59° C 

65° C 

70° C 

1,600 

24 

730 

Moonlight 

89° C 

95° C 

100° C 

614 

16 

530 

Pinto 

89° C 

95° C 

100° C 

162 

16 

67 


for a large data center. Los Alamos National Laboratory (LANL) is home to many large 
supercomputing machines. Several machines are generally housed together in a single 
machine room. This paper focuses on machine room 341, which was (as of January 
2014) home to the three computing machines listed in Table 1. In room 341, cool air 
is pumped into the machine room through perforated tiles in the floor. The cool air is 
then sucked into the machine components (e.g., compute nodes) and hot air is blown 
out. The layout of room 341 with these machines circled is provided in Figure 1. A 
project was initiated with the goal of investigating the effect that cooling changes had 
on the machines, while ensuring that the components of these machines are not subject 
to overheating. A further goal was to develop a general good-practice procedure for 
investigating the effect of cooling changes in other machine rooms and for monitoring 
room 341 as layout changes occur (e.g., the installation of a new cluster). This paper 
focuses primarily on the statistical challenges involved in accomplishing this goal. 

Room 341 has 18 computer room air conditioning (CRAG) units, 16 along the sides 
of the room and two more directly left of Mustang in Figure 1. All of the 18 CRAG 
units were operating at the start of the project. The hypothesis from the facilities 
team was that several of the GRAG units could be shut off while still providing the 
pressure necessary to allow adequate airflow to the supercomputer nodes. Also, the 
current cooling supply temperature of 12.8° G was believed to be too conservative, and 
that this could be increased without having much impact on the cooling of the compute 
machine components. Machine room 341 has other components (such as data storage, 
networking, etc.) in addition to the compute machines, but the compute machines are 
the biggest heat producers and the primary concern. For brevity, we focus our attention 
in this paper on the largest of the three machines, the Mustang cluster. The Mustang 
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Figure 1: Layout of Mustang, Moonlight, and Pinto in Machine Room 341. 
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cluster is the most complicated from a statistical modeling perspective, and it is also 
the most problematic of the three machines, from a heat perspective. The other two 
machines have also been modeled with the same approach to be described here. 

As described in Table 1, the Mustang machine has 1,600 compute nodes, spread out 
among 58 compute racks each housing 28 nodes (with the exception of the last compute 
rack which has only four nodes). There are actually a total of 63 racks, but only 58 
of them house compute nodes (the others are empty or contain network and hie system 
components). The racks are laid out as three rows as can be seen in Figure 1 and further 
in Figure 2, which shows the layout of the nodes within each of the racks. 

Each node is aware of its temperature, measured at the CPU, and it will report a 
warning if it exceeds 59° C. A node will report a high temperature warning if it exceeds 
65° C, but allow the current job to hnish. Once the current job is hnished, that node 
would then be removed from the available nodes in the job queue and inspected for any 
hardware issues that may have led to a high temp. If a node reaches 70° C, the current 
job is killed immediately and the node is removed for inspection. Under normal operation 
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Figure 2; Rack and node layout of the Mustang cluster. 
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it is undesirable for nodes to reach 65° C very often, or to reach 70° C at all. 

The plan used to investigate the effect on the machine(s) due to changes to the cooling 
system was to (i) develop a statistical model for node temperatures over time and space 
as a function of cooling supply temperature and other “effects” to the room, e.g., turning 
off CRAG units or the installation of a new cluster, and (ii) use the model to assess 
the current state-of-the-machine and assess the feasibility of another cooling change. For 
this purpose, the current state-of-the-machine was dehned to be a 95% credible bound 
for the maximum temp that would be achieved if launching a new HPL job on the 
entire machine and letting it run for a full day. The HPL job is a compute intensive 
program that performs an LU decomposition of a large matrix and uses it to solve a 
linear system of equations (Dongarra & Luszczek 2011). HPL tests were conducted 
during designated service times (DSTs) so as not to interfere with user jobs. DSTs occur 
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roughly once a month, but tests occurred more on the order of every two or three months 
as it proved more difficult than anticipated to obtain time for experimentation during 
the DSTs. During the designated HPL tests, which typically ran for about two hours, 
node temperatures were collected every minute. HPL is somewhat synthetic in that it is 
not truly representative of a real production job (i.e., a real scientihc computing job) that 
might run on Mustang, but it is representative of a worst-case computation that a node 
might experience as part of a regular user’s production job. The somewhat conservative 
benchmark program, however, removes the user/job variability and allows for a much 
easier identihcation of the changes that nodes may experience due to effects of interest. 

Some of the temperature data from the HPL experiment to establish a baseline for 
Mustang are displayed in Figure 3. HPL was run on all 1,600 nodes, but for ease of 
display, only temperatures from six selected nodes (1,2,919,920,1317,1318) during the 
experiment are plotted here. The temperature time series for neighboring nodes cluster 
together which is not a fluke as there is substantial spatial correlation between node 
temperatures in these data. A video of the node temperatures during the course of the 
baseline experiment along with complete temperature data for all HPL experiments used 
in this paper are available at the journal website. The time points (during an experiment) 
are approximately 1 minute apart, ranging from about 50 seconds to 70 seconds due to 
the timing of the query from the server to the 1,600 nodes and how busy the nodes are 
at that time, etc. Also, the message to the nodes can be lost and the node may not 
report its temperature for that minute; this happened approximately 10% of the time. 
The various experiments to be analyzed in Section 3 were conducted several weeks apart 
from each other, so there are large time gaps in the data as well. 

The node temperature data display both spatial and temporal variation. In general, 
the space and time domains can be either continuous or discrete which leads to four 
different settings (Cressie & Wikle 2011). Much of the methodological development for 
spatiotemporal data has considered the discrete time domain (e.g.. Waller, Carlin, Xia 
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Figure 3: Temperature data from six selected Mustang nodes during the baseline experiment. 



& Gelfand 1997 for discrete space; and Handcock & Wallis 1994; Gelfand, Banerjee 
& Gamerman 2005 for continuous space). Elaborate correlation models have also been 
proposed for continuous space with continuous time (e.g., Gressie & Huang 1999, Gneiting 
2002). Since the nodes constitute a discrete spatial domain and temperature changes 
continuously with time, the setting in this paper corresponds to a discrete space with 
continuous time. This framework is less common, but has been explored by MacNab & 
Gustafson (2007) and Quick, Banerjee, Garlin et ah (2013). The current application and 
corresponding methodology used here are different in that they place an emphasis on 
extreme values. 

As will be seen in Section 3, the node temperature distribution has a very heavy 
upper tail when running HPL. Being able to represent these extremes well statistically is 
critical to the characterization of the state-of-the-machine, thus, we employ methods from 
spatial extreme value analysis (Davison, Padoan & Ribatet 2012). Standard extreme 
value approaches isolate only extremes, either those above a predetermined threshold 
or the maximum of a group of observations (Goles 2005). In contrast, our analysis is 
facilitated by a model for all observations, both extreme and non-extreme. The marginal 
distribution of node temperature is asumed to be a combination of Gaussian for non¬ 
extremes and a generalized Pareto distribution (GPD) for the tail (Frigessi, Hang & Rue 
2002, Garreau & Bengio 2009a, Garreau & Bengio 20096, Reich, Gooley, Foley, Napelenok 
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& Shaby 2013). The parameters are allowed to vary spatially and by experimental 
conditions, providing a means to assess the effect on cooling due to different scenarios 
and help inform cooling strategy. The structure of the node layout is leveraged to develop 
a Markov random held model (Rue 2001, Rue & Held 2005, Li & Singh 2009) for the 
spatial effects to maintain computational feasibility for large machines. Dependence in 
the residual process is also considered. A natural model for extremal spatial and/or 
temporal dependence is the max-stable process model (e.g.. Smith 1990, Kabluchko, 
Schlather & De Haan 2009, Padoan, Ribatet & Sisson 2010, Wadsworth & Tawn 2012, 
Reich & Shaby 2012, Huser & Davison 2013, Huser & Davison 2014, Wadsworth & Tawn 
2014). However, max-stable processes are motivated primarily for analysis of extremes 
alone and computation is tedious for the large datasets we consider here. Instead, we use 
a Gaussian process (GP) copula (see, e.g.. Nelson (1999), for a general review and Sang & 
Gelfand (2009), for an application to spatial extremes) which is computationally efficient 
and demonstrate that this approach is sufficient to capture the important features of our 
data. 

Sophisticated statistical modeling has been used recently to address issues of power 
consumption (Storlie, Sexton, Pakin, Lang, Reich & Rust 2016) and reliability (Storlie, 
Michalak, Quinn, Dubois, Wender & Dubois 2013, Michalak, DuBois, Storlie, Quinn, 
Rust, DuBois, Modi, Manuzzato & Blanchard 2012) of HPG systems. However, this is 
the first attempt to model spatiotemporal node temperatures in supercomputers. This 
model is then used here to identify causes of temperature issues and assess various cooling 
strategies. The rest of the paper is laid out as follows. Section 2 describes the hierarchical 
Bayesian model for the node temperatures. Section 3 provides an in depth analysis of 
the effect that room changes have on the Mustang nodes, and Section 4 concludes the 
paper. This paper also has online supplementary material containing data and Markov 
chain Monte Garlo (MGMG) details. 
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2 Statistical Model for Node Temperatures 

2.1 Model Description 

The proposed model allows the mean node temp to change (e.g., due to supply temper¬ 
ature and/or other room/node change covariates xj) according to spatial random effects 
/3j. There is also a residual process 6 to capture the remaining variation in space and 
time. A thorough assessment of the feasibility of assumptions made in the model be¬ 
low for application to the node temperatures of Mustang can be found in Section 3.7. 
Specifically, the temperature of node s at time t is 

j 

y{s,t) = (3 q{s) + ^(3j{s)xj{s,t) + 5{s,t) (1) 

i=i 

for s G S'} and t G [0, oo), where (3^ = [/9j(l),... ,/Sj(S')]' ~ N{yjJs, Sj), with 

Js a vector of all ones of length S (the number of nodes), and es,t ~ A^(0, represents 
measurement error. The model for the residual process (5(s, t) requires some care. It must 
accurately represent the extremes of the distribution since they will have a large influence 
on the state-of-the-machine. Thus, S{s,t) is assumed to be a dependent process with 
a marginal distribution that can flexibly account for extremes as described below. For 
computational convenience, it will be assumed that the Sg = 5(s, •) process is independent 
of Sg' for s ^ s'. Upon examination of the residuals, this assumption is entirely reasonable 
for this application after accounting for the spatially varying effects, (3js; see Section 3.7. 
Thus, we present a model for time dependent Sg, independent across s, below. If there 
were significant spatial dependency in S, one could consider using product correlation, 
which has nice computational advantages, but some theoretical drawbacks (Stein 2005), 
or use any multivariate method such as a factor or principle components model. 

The tail of a wide class of distributions is well approximated by a GPD (Coles 2005). 
Hence, we let the marginal density of Sg, fs, be the density of a normal distribution that 
switches to a GPD density for the upper tail, i.e., 

fs{y) = <P hy<^^} + [1 - 

8 


( 2 ) 



where (i) 0 and $ are the standard normal density and CDF, respectively, and f]) is 
the GPD density with shape parameter ^ and scale parameter 77 (and threshold parameter 
equal 0). Specihcally, 



Thus fs is assumed normal mean 0, variance until a point, k standard deviation units 

above the mean, at which point the GPD takes over. The parameter 77 can be chosen as 

a function of k to enforce a continuity of fs at kv which is done here. That is, we assume 

1 — <F(k) 

^ 0(k) 

so that fs{Kv) = 

A model is now described that creates a 6s process with marginal distribution fs in 
(2). A common approach to create a dependent process 6s that has a desired marginal 
distribution is to make use of a copula as in Sang & Gelfand (2009) and Reich (2012). A 
copula is a dependent process with uniform marginals, thus an inverse CDF transform 
of a copula will then produce the desired marginals. For example, assume that 

6s{t) = Fi^[^{Zsm, s = l,...,S, (3) 

where Fs{x) = f^^fs(x) is the desired marginal CDF, and Zs is a stationary Gaussian 
process (CP) with a mean 0, variance 1. For any point f, Zs(t) has a standard normal 
distribution, and Us(t) = ^(Zs(t)) necessarily has a Unif(0,1) distribution, i.e., the 
inverse-CDF transform yields the desired marginal distribution for 6s(t). The process 
I7s(t) is a Gaussian-copula and the resulting 6s is a meta-GP (Demarta & McNeil 2005). 
Here we assume that the correlation model for Zg is exponential, i.e., 

Ks = exp{-0\t - t'\}. (4) 

The meta-GP in (3) is a dependent process with flexible tail behavior in the marginals, 
but it is well known that a meta-GP fails to allow for asymptotic dependence in extreme 
values (Frahm, Junker & Schmidt 2005), i.e., x(c) = Pr(Zs(t) > c | Zs(t — 1) > c) —)■ 0 
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as c —)■ cx) for bivariate Gaussian random variables with correlation less than one (Coles, 
Heffernan & Tawn 2013). Demarta & McNeil (2005) recommend the use of a f-copula to 
allow for such tail dependence. However, the Mustang temperature data provided little 
evidence of asymptotic tail dependence; a detailed investigation is provided in Section 3.7. 
Thus, a Gaussian copula was deemed sufficient for this analysis. 

The model in (1) has several (3^ in space where the spatial dimension (for Mustang) 
is 1,600 and can be more than 10,000 nodes for the largest machines at LANL. Since a 
traditional multivariate normal model requires 0{N^) operations for likelihood evalua¬ 
tion and/or realizations, we assume a Gaussian Markov random Field (GMRF) model 
to alleviate computational burden. Each process fSj is assumed to be a GMRF with 
conditionally autoregressive (CAR) representation 


and precision 




Ef=i 


[Var(/3j-, I 


1=1 


( 5 ) 


where Afs,i is the set of neighbors of type / for node s and ns,i = is the number of 
neighbors of type / = 1,... ,L for node s. The neighborhood relationships for Mustang 
(with L = 7) are illustrated in Figure 4, which is a zoomed in version of Figure 2. 
Horizontally neighboring nodes within the same rack are neighbors of type 1 and are 
given an autoregression coefficient of Ai, while vertically neighboring nodes (which are 
not separated by a shelf) are neighbors of type 2. If nodes are horizontal neighbors, but 
with a rack boundary in between them, they are type 3 neighbors. Likewise, if nodes are 
vertical neighbors, but with a shelf in between, they are type 4 neighbors. If nodes have 
a rack of network components in between them, but are otherwise horizontally aligned, 
they are type 5 neighbors. If nodes have the same geography within rows 1 and 2 but are 
directly across aisle 1 from one another, then they are type 6 neighbors. Finally, because 
of the different orientation of the front/back of the nodes and thus cooling in each aisle. 
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Figure 4: Node neighbor relationships for Mustang 
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neighbors between rows 2 and row 3 (across aisle 2) are treated as type 7 neighbors. 


The conditional mean of (3j^s is ^ times a weighted average of the neighbors, where the 
Xi control the relative weights of the neighbor type in the average. For identihability, it 
is assumed that A = [Ai,..., A/,]' ~ Dirichlet(aA) so that the Xi sum to 1. The parameter 
(f ~ Beta(y4<^, B^) acts as a single autoregression coefficient on the weighted average 
of the neighbors. The Xi can also be compared to determine the relative importance 


of the neighbor types in the dependency. The precision is proportional to the number 
of neighbors (and their weights) that go into the weighted average (e.g., nodes on the 
boundary will have a larger conditional variance because they have fewer neighbors) and 
Tj scales the overall precision of f3j. The CAR representation in (5) results in the following 


precision matrix for (3^, 

-TjXi 

if r G Afs^i 


{Qj)r,s = ■* 

7 Eti 

ii r = s 

(6) 


0 

otherwise. 



In order to assure a positive definite (PD) Qj it is common to assume diagonal dominance 
(DD), i.e., < (Q,)..., a sufficient (but not necessary) condition for PD. It 
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can be seen in (6) that Qj is DD if and only if (^ < 1. 

Perhaps a more common formulation (e.g., Reich, Hodges & Carlin 2007) for a CAR 
model with multiple neighbor types than that in (5) would be 

n) 


I (3j _g) fij + 


i P’- 


Ei=i ns,I 


(7) 


[VaT{/3j,s \ f3j,-s)] ^ = ^j^ns,i. 


1=1 


DD in the case of (7) requires J2ins,iPi < J2ins,i, for all s, which is cumbersome to 
impose exactly. Thus, it is commonly assumed that all p; < 1, which satisfies the DD 
condition, but this is much more restrictive than DD. Hence, the formulation in (5) is 
more flexible and can allow more weight on some neighbor types than is possible in (7). 
It is well-documented in the literature, that the pi parameters need to have values very 
close to 1 to capture any reasonable spatial association between neighbors. Indeed, when 
the model in (7) was applied to the node temperature data, all of the pi were very close 
to 1 and the resulting correlations between neighboring nodes of all neighbor types were 
near ~ 0.6. Meanwhile, the I3j in the posterior had much higher (empirical) correlation 
(~ 0.9) for neighbor types 1-4, indicating a poor fit for the GMRF in (7). 

The issue with the formulation in (7) is that in order to get high (> 0.8) correlation 
between nodes of a given neighbor type I', it is not good enough to have pi' near 1, rather 
all of the Pi must be very near 1. However, this implies high correlation in all directions. 
What is missing is the ability to allow some of the pi > 1 which is essentially what the 
formulation in (5) is allowing for in a convenient parameterization. In contrast, the model 
in (5) fit to the node temperature data results in a (p very close to 1, but with distinct 
separation in the A;, and much higher correlation (~ 0.9) between neighbor types 1-4. 
Thus, the model in (5) allows for a more flexible £t and better assessment of the relative 
dependency between neighbor types as seen in Section 3.1. 

In this problem is a 1,600x1,600 matrix that is sparse with only 9,506 nonzero 
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entries. It can be a large computational advantage to work with a the sparse precision 
matrix resulting from a GMRF for likelihood evaluation or conjugate updates of the jSj 
vector (see the Supplementary Material). In this case (after permutation) is a banded 
matrix with a bandwidth of 43 and Cholesky decomposition takes ~0.0006 seconds in R 
(using the Matrix package) as opposed to ~0.7 seconds (over 1,000 times longer) for a 
dense matrix. This decomposition would otherwise be a computational bottleneck as it 
is needed several times during each MCMC iteration. This computational savings will 
be even more critical when extending this approach to other LANL machines (e.g., the 
Trinity machine will have more than 10,000 nodes). 

A summary of the model structure and the assumed prior distributions are provided 
in Table 2. Relatively diffuse priors were used for all parameters, guided by some intuition 
and input from the operations team. For example, the prior for the mean of 13j, implies 
spatial effects are expected to be smaller in magnitude than 10° C on average. The prior 
for if allows for anything in (0,1), but favors larger values. The prior for 6 essentially 
allows anything from correlation ~ 0.99 to negligible correlation for observations 1 minute 
apart (on the same node). The prior for k has a mode of 2 (standard deviation units) and 
a 99*^^ percentile of 5.8, while the prior for ^ is restricted in this case to be positive (i.e., 
heavy tail) with a mode of 0.5 and a percentile of 3. The prior for was constructed 
based on the expert judgment that the error in temperature measurements should have a 
standard deviation of about 0.5° C. Several adjustments (within reasonable ranges) were 
made to the prior distributions to assess sensitivity. No signihcant sensitivity to prior 
specihcation was found insofar as its effect on the posterior of /3j, 6, k, or 

2.2 MCMC Algorithm 

The complete list of parameters in the model described in (1), (4), and (6) is, 

Q = , ( 8 ) 

where (3 = {/3j{s),j = 1,..., J, s = 1, ..., S}, S = {5^(4,„),« = 1, n = 1, ..., A^J, and 
T = [ti, . .. ,Tj]' . The posterior distribution of these parameters is approximated via 
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Table 2; Summary of hierarchical model for node temperature model defined in (1), (4), and 
(6), and the specification of prior distributions. 


Description 

Model 

Prior Distributions 

Specification 

Spatial Effects 

/3/~ A(^,J5,Q-1) 
as in (1) and (6) 

A(M„5|), j = 0,...,J 

Mj = 0 

52 = 100 

Tj'^T{Ar,Br),j = 0,...,J 

Ar = 1 

Br = 0.5 

Xi ~ Dir(aA) 

aA = 

(p ~ 

A^ = 5 

B^ = 1 

Residual Process 

Ss ~ meta-GP 

as in (2) and (3) 

u2~IG(A„,R„) 

A.y = 5 

B^ = 2 

er^T{Ae,Be) 

Ae = 2 

Be = 2 

K ~ T{A^,B^) 

A. = A 

B^ = 2 

e~r(A^,Rg) 

A^ = 2 
B(.=2 

Measurement Error 

es/^N{0,a^) 

cj2~IG(A,,R,) 

A^ = 10 

Ba = 2 


Markov chain Monte Carlo (MCMC). The complete details of the MCMC algorithm, 
including full conditional distributions, etc., are provided in the Supplementary Material. 
However, an overview is provided here to illustrate the main idea. 

The MCMC routine is a typical hybrid Gibbs, Metropolis Hastings (MH) sampling 
scheme (see e.g., Gelman, Carlin, Stern & Rubin 2014). Conjugate updates are available 
for all parameters in (8) with the exception of 5, A, and 6, which require MH updates. 
The A vector was updated via a random walk proposal using a Dirichlet distribution. 
The time correlation parameter 6 was updated using a Gaussian random walk on log 
scale. If the time-varying residuals 5 were assumed to be a GP, then they would have 
the typical conjugate Gaussian update. Of course, with the model in (3), S is no longer 
a conjugate update, nor can it be integrated out analytically. However, a GP conditional 
on “extreme” data has no trouble acting extreme. A GP would just not produce such 
extreme data, unconditionally, which would result in unrealistic realizations of future 
temperatures, and thus the reason for the model in (3). A simple, yet effective approach 
to the MCMC computation is then to use a conjugate Gaussian update (assuming Sg ~ 
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GP{Q,v‘^Ks)) to form a proposal for 5s in a MH update. This approach provided good 
mixing for the analyses in Section 3 with the beneht of no tuning. With the efficient 
GMRF representation for I3j in place, the computational bottleneck becomes the updating 
of 5s- However, the independence assumption over space for 5s easily allows for parallel 
updates of each 5a- On the largest data set analyzed here (~ 700, 000 observations) the 
MCMC algorithm took ~8 hours for 20,000 iterations (1.5 seconds/iteration) on a 24 
core machine with 2.4GHz processors. 

3 Analysis of the Mustang Cluster 

This section provides a comprehensive data analysis of the effect that various conditions 
had on the node temperatures for the Mustang cluster. For brevity, we restrict attention 
to only the Mustang cluster. A similar analysis was performed for the other compute 
machines in Room 341, but as mentioned previously Mustang was the most problematic 
and interesting. The presentation here contains several incremental analyses that follow 
a chronological account of the data analysis as it occurred in practice from August 2013 
through June 2014. Thus, Section 3.1 hrst discusses the analysis of only the data from 
a baseline experiment that was conducted on 8/7/2013 prior to any cooling changes. 
Section 3.2 then discusses the analysis of some subsequent cooling changes that were 
made during the next few months. The new Wolf cluster was installed in the room in 
February 2014; Section 3.3 examines the effect of adding the new Wolf cluster to the 
room and unravels the cause of an overheating epidemic. The results of Section 3.3 
caused a revision to the benchmark HPL program, which is discussed in Section 3.4. 
Some suspicious results in Section 3.3 prompted the examination of the effect of the 
removal of trays within the rack and rack doors in Section 3.5. Finally, Section 3.6 
provides predictions of the current state-of-the-machine. 
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3.1 Baseline State-of-the-Machine 

The model in (1) was fit using only the spatial intercept /Sq to data from a baseline 
experiment that was conducted on 8/7/2013 prior to any cooling changes. That is, any 
Xj covariates that we ultimately investigate in the following sections remained at hxed 
values here. The posterior mean of (p was 0.9998, indicating a strong dependence in 
the overall temperature level between neighbors. The posterior mean of A was A = 
[0.298, 0.241,0.177, 0.194, 0.086, 0.002, 0.002] which can be used directly to assess the 
relative strength of dependence between the various neighbor types, however, correlation 
is a more intuitive and reliable measure for this (Wall 2004). The correlations between 
each neighbor type, resulting from Qj using A (averaged over the correlations between 
all neighbors of the respective type), are 

[0.882 , 0.870 , 0.846 , 0.848 , 0.756 , 0.561, 0.557]. 

Thus, there is strong dependence between neighbor types (1 and 3), and (2 and 4), the 
close proximity horizontal and vertical directions, respectively. However, there is much 
less dependence between neighbor types 6 and 7 (i.e., across aisles). The posterior mean 
value of was 0.955, indicating a correlation of 0.955 between two observations (from 
the latent GP Zg in (3)) one minute apart on the same node. 

The state-of-the-machine is dehned to be an upper 95% credible bound (CB) for the 
maximum temperature achieved (over all nodes) while running HPL continuously for one 
day. This is obtained by producing a posterior predictive sample, m = 1,..., M, of the 
values of temperatures ym{s,t) for another 24 hours using a dense time grid (i.e., every 
minute). And then, for the posterior sample, extract the maximum temperature 
over s and t; denote this maximum Y^. The state-of-the-machine is provided by T 0 . 95 , 
the 95**^ percentile of the Y^. The state-of-the-machine under baseline room conditions 
is 63.5° C, indicating little chance of HPL producing a high temperature (65° C) alert. It 
is also helpful to complement this overall bound (for the maximum temperature of any 
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node) with bounds for the individual node maxima, to locate hot spots. That is, for the 
posterior sample, for each s, obtain the maximum of the ym{s, t) over t; denote these 
maxima Y*^. A 95% upper credible bound for the maximum achieved by node s is the 
95 th percentile of Y*^, denoted 0.95- Figure 5 provides a graphical display of the 0.95- 

3.2 Effect due to Cooling Changes in the Room 

After the baseline experiment, conducted on on 8/7/2013, and prior to 01/08/2014 three 
changes were made to the room. Specihcally, (i) four of the 18 CRAG units that provide 
cool air into the room were turned off. Also, (ii) several cooling tile (i.e., the perforated 
tiles in the floor) changes were made to allow more airflow to certain locations of Mus¬ 
tang. Finally, (hi) the upper band of the temperature controls (that govern the hottest 
air that the CRAG units can supply) was increased. It was widely believed that increas¬ 
ing the upper temperature band would have little effect on node temperatures since the 
supply air from the GRAG units would be much more sensitive to the lower band temper¬ 
atures (which remained unchanged). For a detailed discussion of the supply temperature 
controls, see Michalak, Bonnie, Montoya, Storlie, Rust, Ticknor, Davey, Moxley & Reich 
(2015). 

Ideally all three of the factors listed above would be studied with separate experi¬ 
ments, however, getting in on a DST to obtain experimental data proved more difficult 
than anticipated. Thus, all three effects listed above were confounded in this study. The 
model in ( 1 ) was used to £t the data with only one covariate Xi equal to 1 or 0 to indicate 
whether or not the GRAG unit/cooling tile/upper band changes had been made. Figure 6 
displays the resulting posterior mean of this effect {/3i) for each node, indicating that the 
cooling changes did not make a substantial increase in node temperatures anywhere. The 
largest estimated increase for any node was 1.1° G. In fact, these changes had a negative 
effect overall on node temperatures (i.e., — 1° G averaged across nodes). Some effects 
were as low as —5° G in the vicinity of Racks 15-20, where some of the perforated cooling 
tiles were added. Figure 7 displays 95% upper GBs for the spatial effects, suggesting 
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Figure 5: Upper 95% CBs for the maximum temperatures achieved while running HPL contin¬ 
uously for one day. 



that the largest that the effect might plausibly be for any given node is about 1.6° C and 
that many of the negative effects are signihcant. Thus it is safe to conclude that the 
shutting down of four of the CRAG units did not have a substantial warming effect on 
the Mustang cluster. Surprisingly, the combination of events (most likely the cooling tile 
changes) had a signihcant cooling affect on many of the nodes. 


3.3 Nodes Started Overheating 

In late February 2014, around the time the new Wolf cluster was installed in the room, 
reports of Mustang nodes overheating (> 65° C) began coming in daily. Some nodes were 
approaching 70° C. The prominent theory was that the airhow around Mustang must 
have been substantially affected by the addition of Wolf. The heavy majority of jobs that 
caused the critically high node temperatures belonged to three users. It was determined 
that all three of the users had been running HPL jobs for performance testing purposes. 
However, the parameters of the HPL job they were using had been “optimized” to provide 
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Figure 6: Posterior mean of /3i: The effect on temperature for each node after adjustment of 
cooling tiles, CRAC units, and upper temperature band. 
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Figure 7: Upper 95% CBs for /3i. 
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Figure 8: Posterior mean effect due to the installation of Wolf (and any room changes that 
happened along with it). 
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used by the cooling team as HPLl. An experiment was run on 04/09/2014 to assess the 
state-of-the-machine and attempt to uncover the cause of the overheating nodes. It was 
considered unlikely at that time that HPL2 could be the cause of a ~ 10° C increase 
in node temperatures. However, HPLl and HPL2 were both run (for 30 minutes each) 
during the experiment. Thus, at this point there are now three covariates in the model: 
Xi - previous cooling changes effect (0 or 1 ), X 2 - presence of wolf cluster (0 or 1 ), and X 3 
- Running HPL2 instead of HPLl (0 or 1). 

Figure 8 displays the effect that the addition of the Wolf cluster (and the airflow 
changes that came along with it) had on the Mustang node temperatures. On average 
(across nodes) this affect is not large, but in some node locations it is close to 4° C. Still, 
considering the baseline individual bounds in Figure 5 (that had remained qualitatively 
unchanged by the cooling changes thus far), it is not possible that Wolf was causing the ~ 
10° C increases to node temperatures that were being observed in practice. Interestingly, 
the effect of Wolf appeared to be abruptly different, spatially, for Racks 44 and 58. 
These racks appeared to be running about 3° C cooler than their respective neighboring 
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Figure 9: Posterior mean effect due to HPL2 being run instead of HPLl. 
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racks. After some investigation, it was discovered that the operations team had removed 
the support trays from these racks and left their doors open at some time prior to the 
experiment. They had speculated that this may have an effect on node temperature. It 
turns out that it does; more on this in Section 3.5. 

Figure 9 displays the posterior mean effect on node temperature due to HPL2 being 
run instead of HPLl. This plot basically tells the overheating story in a nutshell: HPL2 
runs 8.5° C hotter on average and as much as 13° C hotter on some nodes than HPLl. 
While the mystery of overheating nodes was solved, the prior understanding was that 
HPL was the most computational intensive job that a cluster might see. But it was 
not understood that there are substantially different shades of HPL. This prompted the 
cooling team to rethink the appropriateness of the current HPLl benchmark. 


3.4 A New Benchmark HPL 

Should HPL2 be used as the benchmark instead of HPLl, or is HPL2 too conservative? 
What do “extreme” node temperatures on Mustang look like under regular, production 
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Figure 10: (a) Densities of the 95*^ percentile of the hottest 40 users (magenta) versus HPLl 
(blue). Mean difference is about 3° C. (b) Density of the pairwise differences (by node) between 
the 95**^ percentile of the hottest 40 users to HPLl. 

(a) (b) 

HPL1 verus 'Extreme' Users Difference from 'Extreme' Users to HPL1 




use? Temperature records during regular use are recorded ~every hour in a database for 
most of the LANL machines (including Mustang). Thus, these records were used, along 
with job history records for the previous four months to identify the 40 “hottest” users 
of Mustang. An extreme-user-dataset was created that had all temperature readings for 
any node that ran one of the 40 hottest users’ jobs. The empirical 95*^ percentile of the 
extreme-user-dataset was calculated by node and the resulting density (across nodes) is 
displayed in Figure 10(a). For comparison, the density (across nodes) for the empirical 
95**^ percentile temperature achieved while running the HPLl benchmark during the 
same time frame is also displayed as the dashed curve in Figure 10(a). The density of the 
pairwise differences (between nodes) of these 95**^ percentiles (extreme users minus HPLl) 
is displayed in Figure 10(b). It is apparent that the HPLl benchmark was not intensive 
enough. Typical use by the hottest users of Mustang results in node temperatures that 
are about 3° C hotter on average than that of HPLl. However, it is also clear that HPL2 
(which is 8.5° C hotter than HPLl on average) would be far too conservative. 

It was decided that a new benchmark should be constructed that more closely resem¬ 
bled the regular hottest users of Mustang. Thus, an experiment was conducted varying 
the parameters of HPL, to identify the appropriate setting. The two HPL parameters 
varied were N - the size of the matrix, and block size - the size of the tasks sent out 
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Figure 11: (a) Average (across sampled nodes) HPL temperature difference from HPLl achieved 
while varying N and Block Size. Surface estimated with smoothing splines. The setting for 
HPLl is given by the magenta point. Block Size of 5 should provide ~ 5° C increase, (b) 
Distribution of the pairwise differences (by node) between the 95**^ percentile of the hottest 40 
users to HPLl and the distribntion (over nodes) of HPL3 effect (relative to HPLl). 

(a) (b) 



Difference to HPL1 



to each worker. The design was chosen by a Latin-hypercube sample of 25 (with one 
rnn at the HPLl settings, N = 12,800 and block size =1). A DST was not available 
to conduct this experiment, so many regular jobs (each consisting of the 26 HPL runs) 
were submitted to the standard queue to get coverage of the nodes. In all, the 26 HPL 
settings were run on ~ 900 nodes with nearly uniform coverage over space. The resulting 
response surface of the average temperature (across nodes), provided in Figure 11(a), was 
estimated via an additive smoothing spline with a multiplicative interaction term using 
the generalized additive model framework (Hastie & Tibshirani 1990). The HPLl design 
point with block size = 1 and N = 12, 800 is plotted for reference. Node temperature is 
not sensitive to N, but very sensitive to block size, particularly for small values of block 
size. 

It was determined that HPL with block size = 4 and N = 12, 800 (hereafter referred 
to as HPL3) would provide ~ 4° C increase on average across nodes. This setting was 
then run during the next DST (discussed in detail in Section 3.5) to conhrm this hnding 
on the entire Mustang cluster. Figure 12 displays the posterior mean effect across nodes 
of running HPL3 instead of HPLl after htting the model in (1) with these new DST 
data. Figure 11(b) displays the density of the pairwise differences between the extreme 
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Figure 12: Posterior mean effect due to HPL3 being run instead of HPLl. 


Temp 
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users on Mustang to HPLl again, along with a histogram of the posterior mean of the 
node effects due to HPL3. HPL3 is if anything slightly conservative, as desired. 


3.5 Effect of Removing Trays and Doors 

As mentioned in Section 3.3, the effect of Wolf appeared to be abruptly different spatially 
for Racks 44 and 58 in Figure 8 due to the fact that the operations team had removed the 
support trays from these racks and left their doors open. This prompted the cooling team 
to investigate the effect of tray removal from a rack and that of opening doors. Thus on 
06/11/2014 an experiment was conducted to assess the affect of tray and door removal, 
and also to conhrm the effect of the new benchmark HPL3. As it requires a nontrivial 
amount of time to remove trays, the trays were removed prior to the experiment on all 
even numbered racks and remained out for the entirety of the experiment. During the 
experiment the following conditions were tested: (i) HPLl with all doors closed and trays 
out of even racks, (ii) HPLl with even numbered doors opened and trays out of even racks, 
and (iii) HPL3 with even numbered doors opened and trays out of even racks. Only the 
even numbered rack doors where opened since the operations team decided it would be 
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Figure 13: Posterior mean effect due to the removal of trays while running HPL3. On average 
(across nodes) this affect is a ~3.5° C reduction. 
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difficult logistically to open all doors at once. Obviously, a more comprehensive factorial 
design would be preferred, but as before, only a very limited amount of time was available 
for experimentation on the Mustang cluster. Even still, the effect of the removal of doors 
and trays can be estimated for each node due to the spatial correlation of effects in model 
(1). After inspection of the effect plots, there was some clear interaction between tray 
removal and the HPL3 effects. Thus, the present model includes the following covariates: 
Xi - previous cooling changes effect (0 or 1 ), X 2 - presence of wolf cluster (0 or 1 ), xs - 
Running HPL2 instead of HPLl (0 or 1), 2:4 - Running HPL3 instead of HPLl (0 or 1), x^ 
- Trays in or out (0 or 1), xq- doors closed or open (0 or 1), and x^ - Interaction of running 
HPL3 and trays out {X 4 X 5 ). For clarihcation, in the present analysis, racks 44 and 58 also 
now use x^ = xq = 1 when using the experimental data from the previous experiment 
04/09/2014. This was intentionally not the case in the analysis of Section 3.3 in order 
to preserve the natural order of analysis and discovery. 

Figure 13 displays the posterior mean effect due to tray removal while running HPL3. 
This effect is —2.8° C on average across nodes and as much as —5° C for some nodes. 
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The posterior mean effect of opening doors is not nearly as pronounced (only —0.1° C on 
average across nodes). Based on 95% lower CBs, the most beneficial the effect of opening 
doors could be for any given node is —0.9° C. Thus, the cooling benefit that was noticed 
on racks 44 and 58 in Figure 8 must have come from tray removal. 

3.6 Assessing the State-of-the-Machine 

Finally, it is helpful to assess the current state-of-the-machine after changes and to high¬ 
light any potential problem areas of nodes. The current state-of-the-machine with HPL3, 
Wolf in the room, trays out, and doors closed, using the posterior distribution resulting 
from the model fit in Section 3.5 is 64.1° C. This is only slightly hotter than the baseline 
state-of-the-machine (63.5° C), still below the high (65° C) threshold and well below the 
critical (70° C) threshold. Figure 14 displays the corresponding individual upper 95% 
CBs on the node maxima. Compared to the those from baseline in Figure 5, the pre¬ 
dicted maximum temps are a bit hotter in some areas and lower in others, with the most 
problematic nodes now coming from Rack 23. 

3.7 Assessing Model Assumptions 

It is prudent to assess some of the key modeling assumptions underlying the analysis and 
conclusions in the preceding sections. In order to investigate some of these assumptions, 
the fitted (i.e., posterior mean) values S were obtained. Since the state-of-the-machine is 
heavily dependent on extremes, it is important to ensure that S is accurately represented, 
particularly in the tail. Figure 15 provides a histogram and normal Q-Q plot for the 
Ss(t), for all values of s and t. If we had assumed a GP for each Sg, we would expect 
the marginal distribution of 6s{t) to be N{0,v‘^), however, this is clearly not the case. A 
nonparametric logspline density estimate (Kooperberg & Stone 1991) is provided along 
with the Normal-|-GP density fit (and corresponding Q-Q plot) using the posterior mean 
values of = 0.95, k = 1.66, and ^ = 0.120. 

The S process model appears to allow for the correct tail behavior, marginally. 
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Figure 14: Upper 95% CBs for the max node temperatures that would be achieved by running 
HPL3 continuously for one day with Wolf in the room, trays out, and doors closed. 
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Figure 15: Normal, Normal+GPD, and nonparametric estimated marginal densities for 6. 
(a) Histogram and density estimates for 5. (b) log-density estimates for 5. 




(c) Gaussian Q-Q plot for <5. (d) Gaussian -|- GPD Q-Q plot for S. 
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but the properties of the assumed dependency in space and time need to be investi¬ 
gated as well. Figure 16(a) provides a plot of the correlation in the underlying GP 
Z{s,t) = f))] as a function of time lag, estimated empirically from Z{s,t) 

and according to the posterior mean estimate of the exponential model in (4). While 
there may be some evidence to suggest the correlation structure of Z{s, ■) decays slightly 
differently than exponential, the exponential correlation model is a very reasonable ap¬ 
proximation. As mentioned above the meta-GP does not allow or for dependence in 
extreme values, i.e., x(c) = Pr(Z(s,t) > c | Z{s,t — 1) > c) —)■ 0 as c —)■ cx). However, 
in Figure 16(b), the empirically calculated x(c) (aggregated over all s and t) is plotted 
across c. The theoretical x(c) under a bivariate Gaussian assumption along with 95% 
simultaneous bounds (under the Gaussian assumption) on the empirically calculated x(c) 
are also provided for comparison. The theoretical x(c) decays to zero and the empirically 
calculated values appear to be decreasing as well and are within the conhdence bounds. 
This implies that the data are not able to inform us of signihcant tail dependence in 
this case, i.e., there is not evidence to suggest that a Gaussian copula is insufficient for 
this analysis. If this were not the case, a possible extension would be to make use of a t 
copula as described in Demarta & McNeil (2005), for example. 

The model in (1) also assumed no remaining spatial dependence in 6 after accounting 
for the spatially varying /3j coefficients. Figure 16(c) displays the correlation in Z{s,t) 
as a function of spatial distance (i.e., a scaled covariaogram). Distance for this purpose 
was dehned as Euclidean distance from the nodes as they sit in the machine room. It 
is clear from Figure 16(c) that while spatial correlation may be signihcantly different 
from zero, it is negligible in the residual process. Even still, negligible spatial correlation 
does not necessarily rule out the possibility of asymptotic tail dependence. Figure 16(d) 
displays the analog of Figure 16(b) over space, with x*(c) = Pr(Z(s,t) > c | Z{s',t) > c) 
estimated empirically, where s' is the closest node to s. There is clearly no evidence of 
spatial tail dependence and it seems fairly safe to ignore spatial dependence in 6 for this 



Figure 16: Space-time dependence in 6. 


(a) Coic{Z{s,t), Z{s,t — h)) as a function of (b) Tail dependence between time neighbors 
the time lag h. S(s, t) and 6{s,t — 1). 




(c) Cor:{Z{s,t), Z{s',t)) as a function of 
Euclidean distance lls — sT. 
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analysis. Again, if there were non-negligible spatial dependence in 5, a simple extension 
would be a separable space-time model or one could use any multivariate method such 
as factor analysis or principle components, etc. 

Finally, we conclude this section by comparing the results above to those from an 
analysis using a simple mixed effects ANOVA, i.e., each (3j{s) ~ A^(/ij,n|), with con¬ 
tinuous AR(1) (i.e., exponential correlation) residuals performed with the Ime function 
from the nlme package in R. We are not suggesting that this particular analysis would 
be appropriate here as the residual process has a much heavier right tail than the normal 
distribution and there is substantial spatial correlation in the regression coefficients Pj{s). 
This is simply a means to see what may have been gained from the more complex model. 
Ignoring these model violations would lead to bias in the f3j estimates and the variability 
involved. Also, this model still took over an hour to converge for this dataset, so there 
is not a lot of gain from a computation time perspective. Some of the estimated (3j{s) 
coefficients from the simple mixed effects model are quite different than those from the 
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proposed model, particularly when there were missing data for some nodes under some 
conditions. This was most noteable for the effect due to trays and HPL3. The spatial 
dependence should allow for a more reliable estimation in such cases. Many of the esti¬ 
mates are qualitatively similar, though, and the overall conclusions about the effects due 
to Wolf, etc., would not be appreciably different. 

However, a large concern here is that the goal is to evaluate the state-of-the-machine 
based on an upper prediction bound for the temperatures that could be achieved. Thus, 
the model violation for the upper tail behavior of the residual process is problematic. 
The predicted state-of-the-machine using the mixed effects ANOVA model is only 59.8°C, 
which is 4.3° C cooler than that predicted with the proposed model. This is to be expected 
in light of Figure 15, as a Gaussian assumption for S is clearly going to produce overly 
optimistic (smaller) extreme values. In any case, this difference is substantial and could 
easily lead to a poor decision about cooling strategy. Thus, it is far preferable for this 
and similar problems to properly account for the tail behavior of the residuals as in (3). 

4 Conclusions & Further Work 

An analysis framework for assessing the affect that room changes have on node tem¬ 
peratures in supercomputers has been presented. This framework was used to assess 
the effect of cooling system changes and monitor machine room 341 at LANL. The case 
study represents a general good-practice for characterizing the effect of cooling changes 
and monitoring node temperatures in other data centers as layout changes occur. The 
framework developed here follows a spatial linear model framework accounting for non- 
Gaussian heavy tails of the error process via a Gaussian copula and a combination of 
Gaussian and generalized Pareto for the marginal distribution. This analysis framework 
was used to assess the state-of-the-machine after several changes to cooling and uncover 
the cause of a mysterious overheating episode. This same framework can also be easily 
applied to other (larger) data centers as well. For a given bandwidth of the (permuted) 
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precision matrix, the MCMC algorithm is linear in the number of nodes, and linear in the 
number of time points per node. Of course as the number of nodes changes, the layout 
of nodes has to change and this will likely affect the achieved bandwidth, so in practice 
the relationship is not that simple. Still, based on preliminary testing, the analysis of a 
20,000 node machine with layout similar to Mustang and 200 time points per node would 
take ~ 20 hours for 20,000 MCMC iterations on a 48 core machine. 

The experimental data used were collected during DSTs. However, it can be difficult 
to obtain time on a machine during a DST to conduct experiments. Thus, an alternative 
approach to gather data is being explored that runs test jobs on nodes when they are 
not being used by other (regular user) jobs. The drawback is that it may take some time 
(~ 1 week) to cycle through all or most nodes in the cluster in this manner. On the other 
hand, data could then be collected (at least on part of the machine) nearly continuously. 

Very little attention was given in this paper to the question of design of the cooling 
system for optimization of cost, subject to an acceptable level of node temperatures. 
For example, which combination of CRAG units should be used to result in the most 
efficient cooling of the machines? What is the optimal cooling supply temperature to 
use? The analysis framework needed to answer these questions, has been developed here, 
but specihc answers to these questions is a subject of future work. 
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Supplementary Material: “Spatiotemporal Modeling 
of Node Temperatures in Supercomputers” 

A MCMC Algorithm and Full Conditionals 

This section describes the MCMC sampling scheme for the full model described in Sec¬ 
tion 2.1 of the main paper. The entire collection of parameters to be sampled in the 
MCMC is 

(Al) 

The MCMC algorithm proceeds with Gibbs updates for each of the elements of 0 with 
the exception of A = [Ai,..., A^]', and 9, which are updated via a Metropolis Hastings 
(MH) step. Full conditional distributions with which to perform the Gibbs updates are 
provided below for all of the parameter groups listed in (Al) except for S, A, if, 9, n, 
and in which case the specihcs of the MH step is described instead. 

(3 I rest 

Dehne the collective observation vector for all nodes and all times asy = [|/(1,1), |/(1, 2),..., y{S, T)]' 
and similarly dehne the discrepancy vector S = [5(1,1), 5(1, 2),..., 5(5*, T)]'. Let the re¬ 
mainder to y after subtracting off 5 be dehned as 

y = y-6 = Xf3 + e 

where f3 = [/3o(l),/do(2), • • • ,/3j(A)]'. The design matrix X is sparse and can be written 
as 

X = (Xo|Xi|---|Xj) 

where each of the Xj are block diagonal, e.g., Xq = Is® Jt with Is the S x S identity 
matrix and Jt a vector of all ones of length T. Therefore (3 conditional on all other 
parameters and the data reduces to a linear model with variance of the errors known 
and normal prior on the coefficients. That is, a priori (3 where = 


1 



• • • ,/ij]' ® Js and 


Qo 0 0 

, 0 Ri 0 

= Qi3 = . . . ) 

\ 0 0 Qj j 

and Qj from (6) is the sparse S x S precision matrix for f3j = [/dj(l), • • • ,(3j{S)]' for the 
current values of A and Tj. 

The normal distribution is well known to be conjugate in this setting (Gelman et ah 
2014) and thus 

f3 I rest ~ N{m, V) 

where 

V-‘ = IjV'X + Qg 

and 

V~^m= ^X'y + . 

The resulting precision matrix V~^ is sparse and efficient means to perform the matrix 
arithmetic above and generate the multivariate normal for (3 exist for such cases (e.g., 
see Rue & Held (2005) and the Matrix package in R). 

MH update for 6 

Since there is no spatial dependence, each of the 6s can be updated independently (and 
in parallel). If spatial dependence were required, the same approach as below would 
work with the product correlation for space/time and the Kronecker structure could 
be leveraged for computational efficiency. The MH proposal Sg for each s is drawn 
independently of the current value of Sg from a conjugate normal update, assuming a 
priori that 6g ~ N{0^v‘^Rg^), where Re is the inverse of the Tg x Tg time correlation 


2 



matrix whose (i, j) element is exp{—9\ts^i — tsj\}, for observed time points of ■ ■ ■, 
on node s. 

Let be only the elements of y in (A2) corresponding to node s, and similarly for 
Sg and Xg. The remainder to y after subtracting off Xg(3 is 

y^ = y^- Xgf3 = Sg + Sg. (A2) 

Therefore under a normal prior, conditional on all other parameters and the data 
would reduce again to a linear model with variance of the errors known and normal prior 
on the coefficients. The normal distribution is again conjugate in this setting and thus 
the proposal for the MH step is provided by 

S:r^N{mg,Vg), 


where 

Vg^ = H— i^Re, 

and 

V-^rrig = ^y^ 

Let the (multivariate normal) density of this proposal be denoted d{5g)*. Once again, 
sparse banded matrix operations make for the efficient evaluation of d{5g)*. The only 
portion of the full model likelihood that differs between the current value and the proposal 
is <5s, (T^), which is simply iid normal mean Sg and variance Recall the actual 
prior for <5^ is the Normal + GPD model marginal with a GP copula, dehned in (3). Let 
7r(5s) denote the density of this prior distribution which relies on a multivariate normal 
density evaluation (again with a sparse precision). The MH ratio for the Sg update is 
then 

C{y,-,Sg,a^)7r{Sg)d{S:)- 


3 



fij I rest, j = 1,..., J 

(3j ~ N{JsHj, Qj^), so that (3j = JsfJ'j + where cj ~ -/V(0, Qj ^). Let Qj = L'L be 
the (sparse) Cholesky decomposition. Then set, 


= LJsf^j + Luj, 


where Lu ~ N{0, 1). Thus, this is now a simple linear model update of a single regression 
coefficient fij with known residual variance and design matrix LJs- A priori, fij ~ 
N{Mj, Sj), which is conjugate, leading to the following independent updates for each j, 


fij I rest ~ N{mj,Vj), 


where 

and 

Tj I rest, j = 1, ■ ■ ■, J 

f3j ~ N{Jsfij,Q~^). Then set, 

7 , = T(/3,-/x,)*A"iV(0,r2). 

A priori, tj ~ r(AT-, Br), which is conjugate, leading to the following independent updates 


~ ^2 A JsQjJS 


( ^ + 


4 



for each j, 


rest ~ r 




MH update for A 

As mentioned in the main paper, A was updated via a MH random walk proposal using 
a Dirichlet distribution. The random walk was conducted by drawing a proposal A* ~ 
Dirichlet(A/s) for a scale (tuning) parameter s. The tuning parameter was set to s = 
0.015 to achieve an acceptance rate ~40%, and resulted in good mixing. Let the density 
of the proposal, given the current value of A be denoted d{X* \ A). The only portion 
of the full model likelihood that differs between the current value and the proposal is 
n)=oterm of which is a multivariate normal density with mean 
vector JsfJ^j and precision matrix Qj. Evaluation of each of these densities is again made 
efficient by the sparsity of Qj. The MH ratio is then 



7r(A*)d(A 

A*) 


7r(A)d(A* 

A) 


where 7r(A) is the density for a Dirichlet(aA) random vector. 

MH update for ip 

The update for (p was conducted via a MH random walk proposal on logit scale. The 
random walk was conducted by drawing a proposal logit ( 99 *) = logit (99 + e) for a deviate 
e ~ iV(0,s^). The tuning parameter was set to s = 0.05 to achieve an acceptance rate 
~40%, and resulted in good mixing. Let the density of the proposal, given the current 
value of (p be denoted d{(p* \ ip). The only portion of the full model likelihood that differs 
between the current value and the proposal is again £(/3^; Tj, A, 99 ). The MH ratio 


5 



is then 


MH 






TT{(p)d{(p* 

<d) 


where 71 ( 9 ?) is the density for a Beta(A(^, B^) random vector. 


MH update for v'^ 

The MH update for v"^ is a random walk conducted on the log scale, i.e., log(t;^*) = 
log(t;^ + e) for a deviate e ~ iV(0,s^). A value of = 0.003 was used to achieve an 
acceptance rate of ~40%. Let the density of the proposal, given the current value of v'^ 
be denoted diy"^* \ v‘^). The only portion of the full model likelihood that differs between 
the current value and the proposal is n:.i £(« k ), i.e., the product of the prior 

densities for each 5^. The MH ratio is then 

K)n{v^)d{v^* \v‘^) 

where 7 r(t;^) is the density for an IG(A^;, B^) random variable. 


MH update for 9 

The update for 9 follows a completely analogous path as that for The MH update 
for 0 is a random walk conducted on the log scale, i.e., log( 6 '*) = log( 6 ' + e) for a deviate 
e ~ A^(0,s^). A value of = 0.005 was used to achieve an acceptance rate of ~40%. 
The MH ratio is 


nf.i C 't)ir(»)<i(«" I e) 


MH update for k 


The update for k also uses random walk conducted on the log scale. Let the density of 


6 



the proposal, given the current value of k be denoted d{K* \ k). The MH ratio is then 


MH = 


:0:i:K*)TT{K*)d{K \ K*) 
IlLi^i^s-,v‘^,e,^,K)7r{K)d{K* I k) 


where 7r{K) is the density for an Ganinia(74 K, -Bk) random variable. 


MH update for ^ 

Finally, the MH update for ^ is again a random walk conducted on the log scale. Let the 
density of the proposal, given the current value of ^ be denoted d{^* | .^). The MH ratio 
is then 


MH = 




where 7r(,^) is the density for an Gamma(H^, B^) random variable. 


I rest 


iid 


Let E = y — Xf3 — 6. Then E iV(0,(T^), and the inverse-Gamma prior on is 
conjugate, leading to the simple update. 


S T 
^ s=l t=l 
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